GALLEY.TEX[TEX,DEK]6 - www.SailDart.org

perm filename GALLEY.TEX[TEX,DEK]6 blob sn#422982 filedate 1979-03-07 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00005 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\input acphdr
C00003 00003	REDOING PAGES 286ff!\par\vfill\eject\noindent
C00066 00004	\eject
C00067 00005	Redoing the answer pages too...
C00076 ENDMK
C⊗;
\input acphdr
\runninglefthead{ARITHMETIC---FIRST PROOFS $\copyright$ 1978}
\runningrighthead{ARITHMETIC---FIRST PROOFS $\copyright$ 1978}
\section{4.x}
\setcount0 701
\tenpoint
REDOING PAGES 286ff!\par\vfill\eject\noindent
we may calculate the sequences $a↓k$, $d↓k$, $u↓k$,
$v↓k$ defined by the rules
$$\baselineskip15pt
\eqalign{a↓0 ⊗= e,\cr
d↓0⊗=b↓0e,\cr
u↓0⊗=u,\cr
v↓0⊗=b↓0u,\cr}\qquad\eqalign{a↓k ⊗= 2a↓{k-1} \mod f;\cr
d↓k ⊗= (d↓{k-1} + b↓{k\,}a↓k)\mod f;\cr
u↓k ⊗= (u↓{k-1} + 2↑{a↓{k-1}}u↓{k-1})\mod(2↑f-1);\cr
v↓k ⊗= (v↓{k-1} + b↓{k\,}2↑{d↓{k-1}}u↓k)\mod(2↑f-1).\cr}\eqno(30)$$
It is easy to prove by induction on $k$ that
$$\eqalign{a↓k ⊗= (2↑ke)\mod f;\cr
d↓k⊗=\biglp(b↓k\ldotsm b↓1b↓0)↓2\,e\bigrp\mod f;\cr}\qquad\eqalign{
u↓k⊗=(c[2↑{k\,}]u)\mod(2↑f-1);\cr
v↓k ⊗= \biglp c[(b↓k \ldotsm b↓1b↓0)↓2]u\bigrp \mod(2↑f - 1).\cr}\eqno (31)$$
Hence the desired result, $(c[b]u)\mod(2↑f - 1)$,
is $v↓s$. The calculation of $a↓k$, $d↓k$, $u↓k$, $v↓k$ from $a↓{k-1}$,
$d↓{k-1}$, $u↓{k-1}$, $v↓{k-1}$ takes $O(\log f)+O(\log f)+O(f)+O(f)=O(f)$
cycles, and therefore the entire calculation can be done in $s\,O(f)=O(f\log f)$
cycles as desired.

The reader will find it instructive to study the ingenious method
represented by (30) and (31) very carefully. Similar techniques
are discussed in Section 4.6.3.
%folio 385 galley 14 (C) Addison-Wesley 1978,1979	*
\def\vchop#1{\vbox to 0pt{\vskip 0pt minus 100pt\hbox{#1}}}
Sch\"onhage's paper [{\sl Computing \bf 1} (1966), 182--196]
shows that these ideas can be extended to the multiplication
of $n$-bit numbers using \vchop{$r \approx 2↑{\sqrt{\chop to 0pt{\scriptstyle
2\lg n}}}$} moduli, obtaining
a method analogous to Algorithm C. We shall not dwell on the
details here, since Algorithm C is always superior; in fact,
an even better method is next on our agenda.

\subsectionbegin{C. Use of finite Fourier transforms} The
critical problem in high-precision multiplication is the determination
of ``convolution products'' such as
$$u↓rv↓0 + u↓{r-1}v↓1 + \cdots + u↓0v↓r,$$
and there is an intimate relation between
convolutions and an important mathematical concept called ``Fourier transformation.''
If $\omega =\exp(2πi/K)$
is a $K$th root of unity, the one-dimensional Fourier transform
of the sequence $(u↓0, u↓1, \ldotss , u↓{K-1})$ is defined to be the sequence $(
{\A u}↓0,{\A u}↓1, \ldotss , {\A u}↓{\!K-1})$, where
$$\chop to 12pt{{\A u}↓s = \sum ↓{0≤t<K}\omega ↑{st}u↓t,\qquad 0≤s<K.}\eqno(32)$$
Letting $({\A v}↓0, {\A v}↓1, \ldotss , 
{\A v}↓{K-1})$ be defined in the same way, as the Fourier transform of $(v↓0,
v↓1, \ldotss , v↓{K-1})$, it is not difficult to see that $(
{\A u}↓0{\A v}↓0, {\A u}↓1{\A v}↓1, \ldotss , {\A u}↓{\!K-1}
{\A v}↓{\!K-1})$ is the transform of $(w↓0, w↓1, \ldotss , w↓{K-1})$,
where
$$\eqalign{w↓r⊗ = u↓rv↓0 + u↓{r-1}v↓1 + \cdots + u↓0v↓r + u↓{K-1}v↓{r+1}
+ \cdots + u↓{r+1}v↓{K-1}\cr
\noalign{\vskip 7pt}
⊗ =\sum ↓{i+j≡r\,\modulo K}u↓iv↓j.\cr}$$
When $K ≥ 2n-1$ and $u↓n = u↓{n+1} = \cdots = u↓{K-1}
= v↓n = v↓{n+1} = \cdots = v↓{K-1} = 0$, the $w$'s are just what we
need for multiplication, since the terms
 $u↓{K-1}v↓{r+1}+\cdots+
u↓{r+1}v↓{K-1}$ vanish when $0≤r≤2n-2$. In other words,
{\sl the transform of a convolution
product is the ordinary product of the transforms.} This idea
is actually a special case of Toom's use of polynomials $\biglp$cf.\ (10)$\bigrp
$, with $x$ replaced by roots of unity.

If $K$ is a power of 2, the finite Fourier transform (32) can be obtained
quite rapidly when the computations are arranged in a certain way, and so can the
inverse transform (determining the $w$'s from the $\A w$'s). This
property of Fourier transforms was exploited by V.
Strassen in 1968, who discovered how to multiply large numbers faster
than was possible under all previously known schemes.
He and A. Sch\"onhage later refined the method and published improved
procedures in {\sl Computing \bf 7} (1971), 281--292. In order to understand their
approach to the problem, let us first take a look at the mechanism of fast
Fourier transforms.

Given a sequence of $K=2↑k$ complex numbers $(u↓0,\ldotss,u↓{K-1})$, and given
 the complex
number$$\omega=\exp(2πi/K),$$ the sequence $(\A u↓0,\ldotss,\A u↓{K-1})$ defined in
(32) can be calculated rapidly by carrying out the following scheme.\xskip(In these
formulas the parameters $s↓j$ and $t↓j$ are either 0 or 1, so that each ``pass''
represents $2↑k$ computations.)

\yyskip\noindent Pass 0.\xskip Let $A↑{[0]}(t↓{k-1},
\ldotss , t↓0) = u↓t$, where $t = (t↓{k-1} \ldotsm t↓0)↓2$.

\yskip\noindent Pass 1.\xskip Set $A↑{[1]}(s↓{k-1},t↓{k-2}, \ldotss , t↓0)←$

\penalty1000\vskip2pt
\hbox to size{\hfill$ A↑{[0]}(0, t↓{k-2}, \ldotss , t↓0) +
\omega ↑{(s↓{k-1}0\ldotsm0)↓2} \cdot A↑{[0]}(1, t↓{k-2}, \ldotss
, t↓0)$.}

\yskip\noindent Pass 2.\xskip Set $A↑{[2]}(s↓{k-1},
s↓{k-2}, t↓{k-3}, \ldotss , t↓0)←$

\penalty1000\vskip2pt
\hbox to size{\hfill$ A↑{[1]}(s↓{k-1}, 0, t↓{k-3}, \ldotss
, t↓0) + \omega ↑{(s↓{k-2}s↓{k-1}0 \ldotsm 0)↓2} \cdot A↑{[1]}(s↓{k-1},
1, t↓{k-3}, \ldotss , t↓0)$.}

\yskip$\ldots$

\yskip\noindent Pass $k$.\xskip Set $A↑{[k]}(s↓{k-1},
\ldotss , s↓1, s↓0)←$

\penalty1000\vskip2pt
\hbox to size{\hfill$ A↑{[k-1]}(s↓{k-1}, \ldotss , s↓1, 0)
+ \omega ↑{(s↓0s↓1 \ldotsm s↓{k-1})↓2} \cdot A↑{[k-1]}(s↓{k-1},
\ldotss , s↓1, 1)$.}

\yyskip\noindent It is fairly easy to prove by induction that
we have
$$\twoline{A↑{[j]}(s↓{k-1}, \ldotss , s↓{k-j}, t↓{k-j-1}, \ldotss
, t↓0)}{7pt}{\null=\chop to 9pt{\sum ↓{0≤t↓{k-1}, \ldotss , t↓{k-j}≤1}\omega
↑{(s↓0s↓1 \ldotsm s↓{k-1})↓2 \cdot (t↓{k-1} \ldotsm t↓{k-j}0 \ldotsm
0)↓2}\, u↓t,\qquad(33)\hskip-1em}}$$
so that
$$A↑{[k]}(s↓{k-1}, \ldotss , s↓1, s↓0) = {\A u}↓s,\qquad
\hbox{where $s = (s↓0s↓1 \ldotsm s↓{k-1})↓2$.}\eqno(34)$$
(Note the reversed order of the binary digits in
$s$. For further discussion of transforms such as this, see
Section 4.6.4.)

\penalty-200 % good place for a break (March 7, 1979)
To get the inverse Fourier transform $(u↓0,\ldotss,u↓{K-1})$ from the values of
$(\A u↓0,\ldotss,\A u↓{K-1})$, we may note that the ``double transform'' is
$$\twoline{\hskip1pt \A{\A{\hskip-1pt u}}↓r=
\sum↓{0≤s<K}\omega↑{rs}\A u↓s=\sum↓{0≤s,t<K}\omega↑{rs}\omega
↑{st}u↓t}{-6pt}{=\sum↓{0≤t<K}u↓t\,\,\bigglp\sum↓{0≤s<K}\omega↑{s(t+r)}\biggrp
=Ku↓{\,(-r)\mod K},}$$ since the geometric series $\sum↓{0≤s<K}\omega↑{sj}$ sums to
zero unless $j$ is a multiple of $K$. Therefore the inverse transform can be
computed in the same way as the transform itself, except that the final results
must be divided by $K$ and shuffled slightly.

Applying this to the problem of integer multiplication, suppose we wish to compute
the product of two $n$-bit integers $u$ and $v$. As in Algorithm C we shall work
with groups of bits; let
$$2n≤2↑k\,l<4n,\qquad K=2↑k,\qquad L=2↑l,\eqno(35)$$
and write
$$u=(U↓{K-1}\ldotsm U↓1U↓0)↓L,\qquad v=(V↓{K-1}\ldotsm V↓1V↓0)↓L,\eqno(36)$$
regarding $u$ and $v$ as $K$-place numbers in radix $L$ so that each digit $U↓j$
or $V↓j$ is an $l$-bit integer. Actually the leading digits $U↓j$ and $V↓j$ are
zero for all $j≥K/2$, because $2↑{k-1}l≥n$. We will select appropriate values for
$k$ and $l$ later; at the moment our goal is to see what happens in general,
so that we can choose $k$ and $l$ intelligently when all the facts are before us.

The next step of the multiplication procedure is to find the Fourier transforms
$(\A u↓0,\ldotss,\A u↓{K-1})$ and $(\A v↓0,\ldotss,\A v↓{K-1})$ of the
respective sequences $(u↓0,\ldotss,u↓{K-1})$ and $(v↓0,\ldotss,v↓{K-1})$, where
we define
$$u↓t=U↓t/2↑{k+l},\qquad v↓t=V↓t/2↑{k+l}.\eqno(37)$$
This scaling is done for convenience so that the absolute values $|u↓t|$ and
$|v↓t|$ are less than $2↑{-k}$, ensuring that $|\A u↓s|$ and $|\A v↓s|$ will be
less than 1 for all $s$.

An obvious problem arises here, since the complex number $\omega$ can't be
represented exactly in binary notation. How are we going to compute a reliable
Fourier transform? By a stroke of good luck, it turns out that everything
will work properly if we do the calculations with only a modest amount of
precision. For the moment let us bypass this question and assume that
infinite-precision calculations are being performed; we shall analyze later how
much accuracy is actually needed.

Once the $\A u↓s$ and $\A v↓s$ have been determined, we let $\A w↓s=\A u↓s \A v↓s$
for $0≤s<K$ and determine the inverse Fourier transform $(w↓0,\ldotss,w↓{K-1})$.
As explained above, we now have
$$w↓r=\sum↓{i+j=r}u↓iv↓j = \sum↓{i+j=r}U↓iV↓j/2↑{2k+2l},$$
so the integers $W↓r=2↑{2k+2l}w↓r$ are the coefficients in the desired product
$$u\cdot v=W↓{K-2\,}L↑{K-2}+\cdots+W↓1L+W↓0.\eqno(38)$$
Since $0≤W↓r<(r+1)L↑2<KL↑2$, each $W↓r$ has at most $k+2l$ bits, so it will not be
difficult to compute the binary representation when the $W$'s are known unless
$k$ is large compared to $l$.

Let us try to estimate how much time this method takes, if $m$-bit fixed-point
arithmetic is used in calculating the Fourier transforms. Exercise 10 shows
that all of the quantities $A↑{[j]}$ during all the passes of the transform
calculations will be less than 1 in magnitude because of the scaling (37), hence
it suffices to deal with $m$-bit fractions $(.a↓{-1}\ldotsm a↓{-m})↓2$ for the
real and imaginary parts of all the intermediate quantities. Simplifications are
possible because the inputs $u↓t$ and $v↓t$ are real-valued: only $K$ real
values instead of $2K$ need to be carried in each step (see exercise 4.6.4--14).
We will ignore such refinements in order to keep the discussion simple.

The first job is to compute $\omega$ and its powers. For simplicity we shall
make a table of the values $\omega↑0$, $\ldotss$, $\omega↑{K-1}$. Let$$\omega↓r=
\exp(2πi/2↑r),$$so that $\omega↓1=-1$, $\omega↓2=i$, $\omega↓3=(1+i)/\sqrt2$,
$\ldotss$, $\omega↓k=\omega$. If $\omega↓r=x↓r+iy↓r$, we have
$$\omega↓{r+1}=\sqrt{1+x↓r\over2}+i\sqrt{1-x↓r\over2}.\eqno(39)$$
The calculation of $\omega↓1$, $\omega↓2$, $\ldotss$, $\omega↓k$ takes negligible
time compared with the other computations we need, so we can use any straightforward
algorithm for square roots. Once the $\omega↓r$ have been calculated we can
compute all of the powers $\omega↑j$ by starting with $\omega↑0=0$ and using the
following idea for $j>0$: If $j=2↑{K-r}\cdot q$ where $q$ is odd, and if $j↓0=
2↑{K-r}\cdot(q-1)$, we have
$$\omega↑j=\omega↑{j↓0}\cdot\omega↓r.\eqno(40)$$
This method of calculation keeps errors from propagating, since each $\omega↑j$
is a product of at most $k$ of the $\omega↓r$'s. The total time to calculate all
the $\omega↑j$ is $O(KM)$, where $M$ is the time to do an $m$-bit complex
multiplication; this is less time than the subsequent steps will require, so we
can ignore it.

Each of the three Fourier transformations comprises $k$ passes, each of which
involves $K$ operations of the form $a←b+\omega↑j c$, so the total time to calculate
the Fourier transforms is $$O(kKM)=O(Mnk/l).$$Finally, the work involved in
computing the binary digits of $u\cdot v$ using (38) is $O\biglp K(k+l)\bigrp
=O(n+nk/l)$. Summing over all operations, we find that the total time to multiply
$n$-bit numbers $u$ and $v$ will be $O(n)+O(Mnk/l)$.

Now let's see how large the intermediate precision $m$ needs to be, so that we
know how large $M$ needs to be. For simplicity we shall be content with safe
estimates of the accuracy, instead of finding the best possible bounds. It
will be convenient to compute all the $\omega↑j$ so that our approximations
$(\omega↑j)↑\prime$ will satisfy $|(\omega↑j)↑\prime|≤1$; this condition is easy
to guarantee if we truncate towards zero instead of rounding. Let us assume that
our approximations $\omega↓r↑\prime$ to $\omega↓r$ can be written $\omega↓r↑\prime=
\omega↓r(1+\delta↓r)$ and that the multiplication (40) introduces an additional
error factor of $1+\delta↑{(j)}$, where the complex numbers $\delta↓r$ and $\delta
↑{(j)}$ satisfy 
$$|\delta↓r|≤\delta\quad\hbox{and}\quad|\delta↑{(j)}|≤\delta,\qquad
\delta=|2↑{-m}+2↑{-m}\,i|=2↑{1/2-m}.\eqno(41)$$
Then $|(\omega↑j)↑\prime-\omega↑j|≤(1-\delta)↑{2k-2}-1$, and this is less than
$(2k-l)\delta$ if we make the reasonable assumption that $m≥k≥5$. The error
involved in each Fourier transform step can now be estimated as follows: If
$$|b↑\prime-b|≤ε,\qquad |c↑\prime-c|≤ε,\qquad |(\omega↑j)↑\prime|≤1,\qquad
|(\omega↑j)↑\prime-\omega↑j|<(2k-1)\delta,$$
then when we replace the exact computation $a←b+\omega↑jc$ by $$a↑\prime←
\hbox{round}\biglp b↑\prime+(\omega↑j)↑\prime c↑\prime\bigrp$$
we have the absolute error
bound$$|a↑\prime-a|<\delta+ε+ε+(2k-1)\delta=2ε+2k\delta.$$
If the absolute error at the time of ``Pass 0'' is bounded by $ε↓0$, the absolute
error after ``Pass $j$'' will be bounded by $$2↑jε↓0+(2↑j-1)\cdot2k\delta.$$
Therefore the computed
values of $\A u↓s$ will satisfy $|(\A u↓s)↑\prime-\A u↓s|<(2↑k-1)\cdot2k\delta$;
a similar formula will hold for $(\A v↓s)↑\prime$; and we will have $$|(\A w↓s)↑
\prime-\A w↓s|<2(2↑k-1)\cdot2k\delta+\delta.$$ During the inverse transformation
there is an additional accumulation of errors, but the division by $K=2↑k$
ameliorates most of this, so the computed values $w↓r↑\prime$ will satisfy
$$|w↓r↑\prime-w↓r|<4k2↑k\delta.$$We need enough precision to make
$2↑{2k+2l}w↓r↑\prime$ round to the correct integer $W↓r$, hence we need
$$2↑{2k+2l+2+\lg k+k+1/2-m}≤{1\over2},$$ i.e.,
$m≥3k+2l+\lg k+7/2$. This will hold if we simply require that
$$k≥7\qquad\hbox{and}\qquad m≥4k+2l.\eqno(42)$$
Relations (35) and (42) can be used to determine parameters $k$, $l$, $m$ so
that multiplication takes $O(n)+O(Mnk/l)$ units of time, where $M$ is the
time to multiply $m$-bit fractions.

If we are using \MIX, for example, suppose we want to multiply binary numbers
having $n=2↑{13}=8192$ bits each. We can choose $k=11$, $l=8$, $m=60$, so that
the necessary $m$-bit operations are nothing more than double precision arithmetic.
The running time $M$ needed to do fixed-point $m$-bit complex multiplication
will therefore be comparatively small. With triple-precision operations we
can go up to $k=16$, $l=13$, $n≤13\cdot2↑{15}$, which takes us way beyond
\MIX's memory capacity.

Further study of the choice of $k$, $l$, and $m$ leads in fact to a rather
surprising conclusion:\xskip{\sl For all practical purposes we can assume that $M$ is
constant, and the Sch\"onhage-Strassen multiplication technique will have a
running time linearly proportional to $n$.}\xskip The reason is that we can choose
$k=l$ and $m=6k$; this choice of $k$ is always less than $\lg n$, so we will never
need to use more than sextuple precision unless $n$ is larger than the word size
of our computer.\xskip(In particular, $n$ would have to be larger than the
capacity of an index register used to address memory, so we probably couldn't
fit the numbers $u$ and $v$ in main memory.)

The practical problem of fast multiplication is therefore solved, except for
improvements in the constant factor. But our interest in multiplying large
numbers is partly theoretical, because it is interesting to explore the
ultimate limits of computational complexity. So let's forget practical
considerations and suppose that $n$ is extremely huge, perhaps much larger than
the number of atoms in the universe. We can let $m$ be approximately $6\lg n$, and
use the same algorithm recursively to do the $m$-bit multiplications. The
running time will satisfy $T(n)=O\biglp nT(\log n)\bigrp$; hence
$$T(n) ≤ C\,n(C\lg n)(C\lg\lg n)(C\lg\lg\lg n)\ldotss,$$
where the product continues until reaching a factor with $\lg\ldots\lg n≤1$.

Sch\"onhage and Strassen showed how to improve this theoretical upper bound to
$O(n\log n\log\log n)$ in their paper, by using {\sl integer} numbers
$\omega$ to carry out fast Fourier transforms on integers, modulo numbers of the
form $2↑e+1$. This upper bound applies to Turing machines, i.e., to computers with
bounded memory and a finite number of arbitrarily long tapes.

If we allow ourselves a more powerful computer, with random access to any number of
words of bounded size, Sch\"onhage has pointed out that the upper bound drops to
$O(n\log n)$. For we can choose $k=l$ and $m=6k$, and we have time to build a
complete multiplication table of all products $xy$ for
$0≤x,y<2↑{\lceil m/12\rceil}$.\xskip(The number of such products is $2↑k$ or
$2↑{k+1}$, and we can compute each table entry by addition from one of its
predecessors in $O(k)$ steps, hence $O(k2↑k)=O(n)$ steps will suffice for the
calculation.)\xskip In this case $M$ is the time needed to do 12-place arithmetic in
radix $2↑{\lceil m/12\rceil}$, and it follows that $M=O(k)=O(\log n)$ because
1-place multiplication can be done by table lookup.

Sch\"onhage discovered in 1979 that a {\sl pointer machine} can carry out $n$-bit
multiplication in $O(n)$ steps; see exercise 12. Such devices (which are
also called ``storage modification machines'' and ``linking automata'') seem to
provide the best models of computation when $n→∞$, as discussed at the end of
Section 2.6. So we can conclude that multiplication in $O(n)$ steps is possible
for theoretical purposes as well as in practice.

\subsectionbegin{D. Division} Now that we have efficient routines for
multiplication, let's consider the inverse problem. It turns out that division
can be performed just as fast as multiplication, except for a constant factor.

To divide an $n$-bit number $u$ by an $n$-bit
number $v$, we may first find an $n$-bit approximation to $1/v$,
then multiply by $u$ to get an approximation $\A q$ to
$u/v$; finally, we can make the slight correction necessary
to $\A q$ to ensure that $0 ≤ u - qv < v$ by using another
multiplication. From this reasoning, we see that it suffices
to have an efficient algorithm for approximating the reciprocal of an
$n$-bit number. The following algorithm does this, using
``Newton's method'' as explained at the end of Section 4.3.1:

\algbegin Algorithm R (High-precision reciprocal).
Let $v$ have the binary representation $v = (0.v↓1v↓2v↓3\ldotsm
)↓2$, where $v↓1 = 1$. This algorithm computes an approximation
$z$ to $1/v$, such that
$$|z - 1/v| ≤ 2↑{-n}.\eqno(43)$$

\algstep R1. [Initial approximation.] Set
$z ← {1\over 4}\lfloor 32/(4v↓1 + 2v↓2 + v↓3)\rfloor$ and $k ← 0$.

\algstep R2. [Newtonian iteration.] (At this point we have a
number $z$ of the binary form $(xx.xx \ldotsm x)↓2$ with $2↑k
+ 1$ places after the radix point, and $z ≤ 2$.)\xskip Calculate $z↑2
= (xxx.xx \ldotsm x)↓2$ exactly, using a high-speed multiplication
routine. Then calculate $V↓{k\,}z↑2$ exactly, where $V↓k = (0.v↓1v↓2
\ldotsm v↓{2↑{k+1}+3})↓2$. Then set $z
← 2z - V↓{k\,}z↑2 + r$, where $0 ≤ r < 2↑{-2↑{k+1}-1}$
is added if necessary to ``round up'' $z$ so that it is a
multiple of $2↑{-2↑{k+1}-1}$. Finally,
set $k ← k + 1$.

\algstep R3. [Test for end.] If $2↑k < n$, go back to step R2;
otherwise the algorithm terminates.\quad\blackslug

\yyskip This algorithm is based on
a method suggested by S. A. Cook. A similar
technique has been used in computer hardware [see Anderson,
Earle, Goldschmidt, and Powers, {\sl IBM J. Res.\ Dev.\ \bf 11}
(1967), 48--52]. Of course, it is necessary to check the accuracy
of Algorithm R quite carefully, because it comes very close
to being inaccurate. We will prove by induction that
$$z ≤ 2\qquad\hbox{and}\qquad |z - 1/v| ≤ 2↑{-2↑k}\eqno(44)$$
at the beginning and end of step R2.

For this purpose, let $\delta ↓k = 1/v - z↓k$, where
$z↓k$ is the value of $z$ after
$k$ iterations of step R2. To start the induction on $k$, we
have$$\delta ↓0 = 1/v - 8/v↑\prime + {(32/v↑\prime - \lfloor
32/v↑\prime \rfloor )/4} = \eta ↓1 + \eta ↓2,$$ where $v↑\prime
= (v↓1v↓2v↓3)↓2$ and $\eta ↓1 = (v↑\prime - 8v)/vv↑\prime$, so that
$-{1\over 2} < \eta ↓1 ≤ 0$ and $0 ≤ \eta ↓2 < {1\over 4}$.
Hence $|\delta ↓0| < {1\over 2}$. Now suppose that (44) has been
verified for $k$; then
$$\baselineskip14pt
\eqalign{\delta↓{k+1} = 1/v - z↓{k+1} ⊗= 1/v - z↓k - z↓k(1- z↓kV↓k) - r\cr
⊗ = \delta ↓k - z↓k(1 - z↓kv) - z↑{2}↓{k}(v - V↓k) - r\cr
⊗ = \delta ↓k - (1/v - \delta ↓k)v\delta ↓k - z↑{2}↓{k}(v - V↓k) - r\cr
⊗ = v\delta ↑{2}↓{k} - z↑{2}↓{k}(v - V↓k) - r.\cr}$$
Now
$$0 ≤ v\delta ↑{2}↓{k} < \delta ↑{2}↓{k} ≤ (2↑{-2↑k})↑2
= 2↑{-2↑{k+1}},$$
and
$$0 ≤ z↑2(v - V↓k) + r < 4(2↑{-2↑{k+1}-3})+2↑{-2↑{k+1}-1}=2↑{-2↑{k+1}},$$
so $|\delta ↓{k+1}| ≤ 2↑{-2↑{k+1}}$.
We must still verify the first inequality of (44); to show
that $z↓{k+1} ≤ 2$, there are three cases:\xskip (a) $V↓k = {1\over
2}$; then $z↓{k+1} = 2$.\xskip (b) $V↓k ≠ {1\over 2} = V↓{k-1}$; then
$z↓k = 2$, so $2z↓k - z↑{2}↓{k}V↓k ≤ 2 - 2↑{-2↑{k+1}-1}$.\xskip
(c) $V↓{k-1} ≠ {1\over 2}$; then $z↓{k+1}
= 1/v - \delta ↓{k+1} < 2 - 2↑{-2↑{k+1}}≤2$, since $k>0$.

The running time of Algorithm R is bounded by
$$\textstyle 2T(4n) + 2T(2n) + 2T(n) + 2T({1\over 2}n)+\cdots+O(n)$$
steps, where $T(n)$ is an upper bound on the time needed to do a multiplication
of $n$-bit numbers. If $T(n)$ has the form $nf(n)$ for some monotonically
nondecreasing function
$f(n)$, we have$$T(4n) + T(2n) + T(n) + \cdots < T(8n),$$so division can
be done with a speed comparable to that of multiplication except
for a constant factor.

R. P. Brent has shown that functions such as $\log x$, $\exp x$,
and $\mathop{\hbox{arctan}}x$
 can be evaluated to $n$ significant bits in $O\biglp M(n)\log n\bigrp$
steps, if it takes $M(n)$ units of time to multiply $n$-bit numbers [{\sl
JACM \bf23} (1976), 242--251].

\subsectionbegin{E. An even faster multiplication method}
It is natural to wonder if multiplication of $n$-bit numbers
can be accomplished in just $n$ steps. We have come from order
$n↑2$ down to order $n$, so perhaps we can squeeze the
time down even more. In fact, 
it is actually possible to output the answer as fast as we input the
digits, if we leave
the domain of conventional computer programming and allow ourselves
to build a computer that has an unlimited number of components
all acting at once.

A {\sl linear iterative array} of automata is a set of devices
$M↓1$, $M↓2$, $M↓3$, $\ldots$ that can each be in a finite set of
``states'' at each step of a computation. The machines $M↓2$,
$M↓3$, $\ldots$ all have {\sl identical} circuitry, and their state
at time $t + 1$ is a function of their own state at time $t$
as well as the states of their left and right neighbors at time
$t$. The first machine $M↓1$ is slightly different: its state
at time $t + 1$ is a function of its own state and that of $M↓2$,
at time $t$, and also of the {\sl input} at time $t$. The {\sl
output} of a linear iterative array is a function defined on
the states of $M↓1$.

Let $u = (u↓{n-1} \ldotsm u↓1u↓0)↓2$, $v =
(v↓{n-1} \ldotsm v↓1v↓0)↓2$, and $q = (q↓{n-1} \ldotsm q↓1q↓0)↓2$
be binary numbers, and let $uv + q = w = (w↓{2n-1} \ldotsm w↓1w↓0)↓2$.
It is remarkable fact that a linear iterative array can be constructed,
independent of $n$, that will output $w↓0$, $w↓1$, $w↓2$, $\ldots$
at times 1, 2, 3, $\ldotss$, if it is given the inputs $(u↓0,
v↓0, q↓0)$, $(u↓1, v↓1, q↓1)$, $(u↓2, v↓2, q↓2)$, $\ldots$ at times
0, 1, 2, $\ldotss\,$.

We can state this phenomenon in the language of computer hardware,
by saying that it is possible to design a single ``integrated
circuit module'' with the following property: If we wire together
sufficiently many of these devices in a straight line, with
each module communicating only with its left and right neighbors,
the resulting circuitry will produce the $2n$-bit product of
$n$-bit numbers in exactly $2n$ clock pulses.

Here is the basic idea behind this construction: At time 0,
machine $M↓1$ senses $(u↓0, v↓0, q↓0)$ and it therefore is able to output
$(u↓0v↓0 + q↓0)\mod 2$ at time 1. Then it sees $(u↓1, v↓1, q↓1)$
and it can output $(u↓0v↓1 + u↓1v↓0 + q↓1 + k↓1)\mod 2$, where
$k↓1$ is the ``carry'' left over from the previous step, at
time 2. Next it sees $(u↓2, v↓2, q↓2)$ and outputs $(u↓0v↓2
+ u↓1v↓1 + u↓2v↓0 + q↓2 + k↓2)\mod 2$; furthermore, its state
records the values of $u↓2$ and $v↓2$ so that machine $M↓2$
will be able to sense these values at time 3, and $M↓2$ will
be able to compute $u↓2v↓2$ for the benefit of $M↓1$ at time
4. Machine $M↓1$ essentially arranges to start $M↓2$ multiplying the sequence
$(u↓2, v↓2)$, $(u↓3, v↓3)$, $\ldotss $, and $M↓2$ will ultimately
give $M↓3$ the job of multiplying $(u↓4, v↓4)$, $(u↓5, v↓5)$,
etc. Fortunately, things just work out so that no time is lost. The
reader will find it interesting to deduce further details from the formal
description that follows.
%folio 394 galley 17 (C) Addison-Wesley 1978	*
\topinsert{\vbox to 520pt{\hbox{(Table 1 will go on this page, 
it's being set separately)}}}

Each automaton has $2↑{11}$
states$$(c, x↓0, y↓0, x↓1, y↓1, x, y, z↓2, z↓1, z↓0),$$where
$0 ≤ c < 4$ and each of the $x$'s, $y$'s, and $z$'s is either
0 or 1. Initially, all devices are in state $(0, 0, 0, 0, 0,
0, 0, 0, 0, 0)$. Suppose that a machine $M↓j$, for $j > 1$, is in state
$(c, x↓0, y↓0, x↓1, y↓1, x, y, z↓2, z↓1, z↓0)$ at time $t$,
and its left neighbor $M↓{j-1}$ is in state
$(c↑l,x↓0↑l,y↓0↑l,x↓1↑l,y↓1↑l,x↑l,y↑l,z↓2↑l,z↓1↑l,z↓0↑l)$
while its right neighbor $M↓{j+1}$ is in state
$(c↑r,x↓0↑r,y↓0↑r,x↓1↑r,y↓1↑r,x↑r,y↑r,z↓2↑r,z↓1↑r,z↓0↑r)$ at that time.
Then machine $M↓j$ will go into state
$(c↑\prime,x↓0↑\prime,y↓0↑\prime,x↓1↑\prime,y↓1↑\prime,x↑\prime,y↑\prime,
z↓2↑\prime,z↓1↑\prime,z↓0↑\prime)$ at time $t+1$, where
$$\baselineskip14pt
\vcenter{\halign{$\hfill#$⊗$\null#\hfill$\qquad⊗if
$\hfill#$⊗$\null#,\qquad$⊗$#\hfill$\ \ otherwise;\cr
c↑\prime=\min⊗(c+1,3)⊗c↑l⊗=3⊗0\cr
(x↓0↑\prime,y↓0↑\prime)⊗=(x↑l,y↑l)⊗c⊗=0⊗(x↓0,y↓0)\cr
(x↓1↑\prime,y↓1↑\prime)⊗=(x↑l,y↑l)⊗c⊗=1⊗(x↓1,y↓1)\cr
(x↑\prime,y↑\prime)⊗=(x↑l,y↑l)⊗c⊗≥2⊗(x,y)\cr}}\eqno(45)$$
and $(z↑{\prime}↓{2}z↑{\prime}↓{1}z↑{\prime}↓{0})↓2$
is the binary notation for
$$z↓0↑r+z↓1+z↓2↑l+\left\{\,\vcenter{\baselineskip14pt
\halign{$\dispstyle#$,\hfill\quad⊗if $c=#$\hfill\cr
x↑ly↑l⊗0;\cr
x↓0y↑l+x↑ly↓0⊗1;\cr
x↓0y↑l+x↓1y↓1+x↑ly↓0⊗2;\cr
x↓0y↑l+x↓1y+xy↓1+x↑ly↓0⊗3.\cr}}\right.\eqno(46)$$
The leftmost machine $M↓1$ behaves in almost
the same way as the others; it acts exactly as if there were
a machine to its left in state $(3, 0, 0, 0, 0, u, v, q, 0, 0)$
when it is receiving the inputs $(u, v, q)$. The output of the
array is the $z↓0$ component of $M↓1$.

Table 1 shows an example of this array acting
on the inputs $$u = v = (\ldotsm 00010111)↓2,\qquad q = (\ldotsm 00001011)↓2.$$
The output sequence appears in the lower right portion of the
states of $M↓1$:$$\hbox{0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, $\ldotss $,}$$ representing
the number $(\ldotsm 01000011100)↓2$ from right to left.

This construction is based on a similar one first published
by A. J. Atrubin, {\sl IEEE Trans. \bf EC--14} (1965),
394--399.\xskip S. Winograd [{\sl JACM \bf 14} (1967), 793--802] has
investigated the minimum multiplication time achievable in a
logical circuit when $n$ is given and when the inputs are available
all at once in coded form; see also C. S. Wallace, {\sl IEEE
Trans.\ \bf EC--13} (1964), 14--17.

\exbegin{EXERCISES}

\exno 1. [22] The idea
expressed in (2) can be generalized to the decimal system, if
the radix 2 is replaced by 10. Using this generalization, calculate
2718 times 4742 (reducing this product of four-digit numbers
to three products of two-digit numbers, and reducing each of
the latter to products of one-digit numbers).

\exno 2. [M22] Prove that, in step C1 of Algorithm C\null, the value
of $R$ either stays the same or increases by one when we set
$R ← \lfloor \sqrt{Q}\rfloor $.\xskip (Therefore,
as observed in that step, we need not calculate a square root.)

\exno 3. [M23] Prove that the sequences $q↓k$, $r↓k$ defined in
Algorithm C satisfy the inequality $2↑{q↓k+1}(2r↓k)↑{r↓k}
≤ 2↑{q↓{k-1}+q↓k}$, when $k > 0$.

\trexno 4. [28] (K. Baker.)\xskip Show
that it is advantageous to evaluate the polynomial $W(x)$ at
the points $x = -r$, $\ldotss$, 0, $\ldotss$, $r$ instead of at the nonnegative
points $x = 0$, 1, $\ldotss$, $2r$ as in Algorithm C\null. The polynomial
$U(x)$ can be written $$U(x) = U↓e(x↑2) + xU↓o(x↑2),$$ and similarly
$V(x)$ and $W(x)$ can be expanded in this way; show how to exploit
this idea, obtaining faster calculations in steps C7 and C8.

\trexno 5. [35] Show that
if in step C1 of Algorithm C we set $R ← \lceil
\sqrt{2Q}\,\rceil + 1$ instead of $R ← \lfloor
\sqrt{Q}\rfloor $, with suitable initial values of $q↓0$, $q↓1$,
$r↓0$, and $r↓1$, then (19) can be improved to $t↓k ≤ q↓{k+1}2↑
{\sqrt{2\lg q↓{k+1}}}\,(\lg q↓{k+1})$.

\exno 6. [M23] Prove that the six
numbers in (22) are relatively prime in pairs.

\exno 7. [M23] Prove (23).

\trexno 8. [25] Prove that it takes only $O(K\log K)$ arithmetic operations to
evaluate the finite Fourier transform (32), even when $K$ is not a power of 2.\xskip
[{\sl Hint:} Rewrite (32) in the form
$$\A u↓s=\omega↑{-s↑2/2}\sum↓{0≤t<K}\omega↑{(s+t)↑2/2}\omega↑{-t↑2/2}u↓t$$
and express this sum as a convolution product.]

\exno 9. [M15] Suppose the Fourier transformation method of the text is applied
with all occurrences of $\omega$ replaced by $\omega↑q$, where $q$ is some fixed
integer. Find a simple relation
 between the numbers $(\s u↓0,\s u↓1,\ldotss,
\s u↓{K-1})$ obtained by this general procedure and the numbers $(\A u↓0,\A u↓1,
\ldotss,\A u↓{K-1})$ obtained when $q=1$.

\exno 10. [M26] The scaling in (37) makes it clear that all the complex numbers
$A↑{[j]}$
computed by pass $j$ of the transformation subroutine will be less than
$2↑{j-k}$ in absolute value, during the calculations of $\A u↓s$ and $\A v↓s$
in the Sch\"onhage-Strassen multiplication algorithm. Show that all of the
$A↑{[j]}$ will be less than 1 in absolute value during the {\sl third}
Fourier transformation (the calculation of $w↓r$).

\trexno 11. [M26] If $n$ is fixed, how many of the automata in
the linear iterative array (45), (46) are needed to compute
the product of $n$-bit numbers?\xskip (Note that the automaton $M↓j$
is influenced only by the component $z↑{r}↓{0}$ of the machine
on its right, so we may remove all automata whose $z↓0$ component
is always zero whenever the inputs are $n$-bit numbers.)

\trexno 12. [30] (A. Sch\"onhage.)\xskip The purpose of this exercise is to prove
that a simple form of pointer machine can multiply $n$-bit numbers in $O(n)$ steps.
The machine has no built-in facilities for arithmetic; all it does is work with
nodes and pointers. Each node has the same finite number of link fields,
and there are finitely many link registers. The only operations this machine can do
are:

\def\\(#1){\noindent\hbox to 30pt{\hfill #1) }\hangindent 30pt\!}
\yskip
\\(i) read one bit of input and jump if that bit is 0;

\\(ii) output 0 or 1;

\\(iii) load a register with the contents of another register or with
the contents of a link field in a node pointed to by a register;

\\(iv) store the contents of a register into a link field in a node pointed to by a
register;

\\(v) jump if two registers are equal;

\\(vi) create a new node and make a register point to it;

\\(vii) halt.

\yskip\noindent
Implement the Fourier-transform multiplication method efficiently on such a
ma\-chine.\xskip[{\sl Hints:} First show that if $N$ is any positive integer, it
is possible to create $N$ nodes representing the integers $\{0,1,\ldotss,N\}$,
where the node representing $p$ has pointers to the nodes representing $p+1$,
$\lfloor p/2\rfloor$, and $2p$. These nodes can be created in $O(N)$ steps.
Show that arithmetic with radix $N$ can now be simulated without difficulty:
for example, it takes $O(\log N)$ steps to find the node for $(p+q)\mod N$ and
to determine if $p+q≥N$, given pointers to $p$ and $q$; and multiplication
can be simulated in $O(\log N)↑2$ steps. Now consider the algorithm in the text,
with $k=l$ and $m=6k$ and $N=2↑{\lceil m/13\rceil}$, so that all quantities
in the fixed-point arithmetic calculations are 13-place integers with radix $N$.
Finally, show that each pass of the fast Fourier transformations can be done in
$O\biglp K+(N\log N)↑2\bigrp=O(K)$ steps, using the following idea: Each of
the $K$ necessary assignments can be ``compiled'' into a bounded list of
instructions for a simulated \MIX-like computer whose word size is $N$, and
instructions for $K$ such machines acting in parallel can be simulated in
$O\biglp K+(N\log N)↑2\bigrp$ steps if they are first sorted so that all
identical instructions are performed together.\xskip(Two instructions are identical
if they have the same operation code, the same register contents, and the
same memory operand contents.)\xskip
Note that $N↑2=O(n↑{12/13})$, so $(N\log N)↑2=O(K)$.]

\exno 13. [M25] (A. Sch\"onhage.)\xskip What is a good upper
bound on the time needed to multiply an $m$-bit number by an
$n$-bit number, when both $m$ and $n$ are very large but $n$
is much larger than $m$, based on the results proved in this
section for $m = n$?

\exno 14. [M42] Write a program for Algorithm C\null, incorporating
the improvements of exercise 4. Compare it with a program for
Algorithm 4.3.1M and with a program based on (2), to see how
large $n$ must be before Algorithm C is an improvement.

\exno 15. [M49] (S. A. Cook.)\xskip A multiplication algorithm is said to be
{\sl on line} if the $(k+1)$st input bits of the operands, from right to left,
are not read until the $k$th output bit has been produced. What are the fastest
possible on-line multiplication algorithms achievable on various species of
automata?

$\biglp$The best upper bound known is
$O\biglp n(\log n)↑2\log\log n\bigrp$, due to M. J. Fischer and L. J.
Stockmeyer [{\sl J. Comp.\ and Syst.\ Sci.\ \bf9} (1974), 317--331]; their
construction works on multitape Turing machines, hence also on pointer machines.
The best lower bound known is $O(n\log n/\!\log\log n)$, due to M. S. Paterson,
M. J. Fischer, and A. R. Meyer [{\sl SIAM/AMS Proceedings \bf7} (1974), 97--111],
applying to multitape Turing machines but not to pointer machines.$\bigrp$

\vfill
\eject
\setcount0 801
\runninglefthead{ANSWERS TO EXERCISES---FIRST PROOFS $\copyright$ 1978}
\runningrighthead{ANSWERS TO EXERCISES---FIRST PROOFS $\copyright$ 1978}
\section{4.x}
\ninepoint
Redoing the answer pages too...

\vfill\eject
This method reduces the size of the numbers being
handled, and reduces the number of additions and multiplications.
Its only disadvantage is a longer program (since the control
is somewhat more complex, and some of the calculations must
be done with signed numbers).

Another possibility would perhaps be
to evaluate $W↓e$ and $W↓o$ at $1↑2$, $2↑2$, $4↑2$, $\ldotss$, $(2↑r)↑2$;
although the numbers involved are larger, the calculations are
faster, since all multiplications are replaced by shifting and
all divisions are by binary numbers of the form $2↑j(2↑k - 1)$.\xskip
(Simple procedures are available for dividing by such numbers.)

\ansno 5. Start the $q, r$ sequences out with $q↓0$ and
$q↓1$ large enough so that the inequality in exercise 3 is valid.
Then we will find in the formulas analogous to those preceding
Theorem C that $\eta ↓1 → 0$ and $\eta ↓2 = \biglp 1 + 1/(2r↓k)\bigrp 2↑{1
+\sqrt{2Q↓k}-\sqrt{2Q↓{k+1}}}\,(Q↓k/Q↓{k+1})$. The factor $Q↓k/Q↓{k+1} → 1$
as $k → ∞$, so we can ignore it if we want to show that $\eta ↓2
< 1 - ε$ for all large $k$. Now $\sqrt{2Q↓{k+1}} = 
\sqrt{2Q↓k + 2\lceil\sqrt{ 2Q↓k}\,\rceil + 2} ≥ 
\sqrt{(2Q↓k + 2\sqrt{2Q↓k} + 1) + 1} ≥ \sqrt{2Q↓k} + 1 + 1/(3R↓k)$. Hence $\eta
↓2 ≤ \biglp1 + 1/(2r↓k)\bigrp2↑{-1/(3R↓k)}$, and $\lg\eta ↓2 < 0$ for
large enough $k$.

{\sl Note:} Algorithm C can also be modified to
define a sequence $q↓0$, $q↓1$, $\ldots$ of a similar type that
is based on $n$, so that $n \approx q↓k + q↓{k+1}$ after step
C1. This modification leads to the estimate (19).

\ansno 6. Any common divisor of
$6q + d↓1$ and $6q + d↓2$ must also divide their difference
$d↓2 - d↓1$. The $6\choose2$ differences are 2, 3, 4, 6,
8, 1, 2, 4, 6, 1, 3, 5, 2, 4, 2, so we must only show that at
most one of the given numbers is divisible by each of the primes
2, 3, 5. Clearly only $6q + 2$ is even, and only $6q + 3$ is
a multiple of 3; and there is at most one multiple of 5, since
$q↓k \neqv 3\modulo 5$.

\ansno 7. $t↓k ≤ 6t↓{k-1} + ck3↑k$ for some constant $c$; so
$t↓k/6↑k ≤ t↓{k-1}/6↑{k-1} + ck/2↑k ≤ t↓0 + c \sum ↓{j≥1}(j/2↑j)
= M$. Thus $t↓k ≤ M \cdot 6↑k$.

\ansno 8. Let $2↑k$ be the smallest power of 2 that exceeds $2K$. Set $a↓t←\omega
↑{-t↑2/2}u↓t$ and $b↓t←\omega↑{(2K-2-t)↑2/2}$, where $u↓t=0$ for $t≥K$. We want
to calculate the convolutions $c↓r=\sum↓{0≤j≤r}a↓jb↓{r-j}$ for $r=2K-2-s$, when
$0≤s<K$. The convolutions can be found by using three fast Fourier transformations
of order $2↑k$, as in the text's multiplication procedure.\xskip[The origin of
this trick is unknown.]

\ansno 9. $\s u↓s=\A u↓{(qs)\mod K}$. In particular, if $q=-1$ we get $\A u↓{(-
r)\mod K}$, which avoids shuffling when computing inverse transforms.

\ansno 10. $A↑{[j]}(s↓{k-1},\ldotss,s↓{k-j},t↓{k-j-1},\ldotss,t↓0)=\hskip-.6pt
\sum↓{0≤t↓{k-1},
\ldotss,t↓{k-j}≤1}\omega↑{(s↓0\ldotsm s↓{k-1})↓2\cdot(t↓{k-1}\ldotsm t↓{k-j}0\ldotsm
0)↓2}\*\biglp\sum↓p\omega↑{tp}u↓p\bigrp\biglp\sum↓q\omega↑{tq}v↓q\bigrp=
\sum↓{p,q}u↓pv↓qS(p,q)$, where $S(p,q)=0$ or $2↑j$. We have $S(p,q)=2↑j$
for exactly $2↑{2k}/2↑j$ values of $p$ and $q$.

\ansno 11. An automaton cannot have $z↓2 = 1$ until it has $c
≥ 2$, and this occurs first for $M↓j$ at time $3j - 1$. It follows
that $M↓j$ cannot have $z↓2z↓1z↓0 ≠ 000$ until time $3(j - 1)$.
Furthermore, if $M↓j$ has $z↓0 ≠ 0$ at time $t$, we cannot change
this to $z↓0 = 0$ without affecting the output; but the output
cannot be affected by this value of $z↓0$ until at least time
$t + j - 1$, so we must have $t + j - 1 ≤ 2n$. Since the first
argument we gave proves that $3(j - 1) ≤ t$, we must have $4(j
- 1) ≤ 2n$, that is, $j - 1 ≤ n/2$, i.e., $j ≤ \lfloor n/2\rfloor
+ 1$. This is the best possible bound,
since the inputs $u = v = 2↑n - 1$ require the use of $M↓j$
for all $j ≤ \lfloor n/2\rfloor + 1$.\xskip (For example, note from
Table 1 that $M↓2$ is needed to multiply two-bit numbers, at
time 3.)

\ansno 12. We can ``sweep through'' $K$ lists of \MIX-like instructions, executing
the first instruction on each list, in $O\biglp K+(N\log N)↑2$ steps as
follows:\xskip(1) A radix list sort (Section 5.2.5) will group together all
identical instructions, in time $O(K+M)$.\xskip(2) Each set 
\eject % good place to break page (March 7, 1979)
of $j$ identical
instructions can be performed in $O(\log N)↑2+O(j)$ steps, and there are $O(N↑2)$
sets.
A bounded number of sweeps will finish all the lists. The remaining details are
straightforward; for example, arithmetic operations can be simulated by converting
$p$ and $q$ to binary.\xskip[To appear.]

\ansno 13. If it takes $T(n)$ steps to multiply
$n$-bit numbers, we can accomplish the multiplication of $m$-bit
by $n$-bit by breaking the $n$-bit number into $\lceil n/m\rceil$
$m$-bit groups, using $\lceil n/m\rceil T(m) + O(n + m)$ operations.
The results of this section therefore give an estimated running
time of $O(n\log m\log\log m)$ on Turing machines, or $O(n\log m)$ on machines with
random access to words of bounded size, or $O(n)$ on pointer machines.